Preamble

This dataset was obtained from Kaggle here and includes the lat/lon coordinates of meteorite impact sites.
There 45,000 impact sites and in this analysis we will be analyzing the ones inside of the continential United States.

Data Cleaning

First we need to check for data integrity. If we count the NA’s we can see that they are most frequent in the lat/long, year, and then mass.

colSums(is.na(meteorites))
##        name          id    nametype    recclass        mass        fall 
##           0           0           0           0         131           0 
##        year      reclat     reclong GeoLocation 
##         288        7315        7315           0

We can confirm this visually by passing our dataset to a missing values map from the ‘Amelia’ package.

missmap(meteorites, 
        main = "Missingness Map of Meteorite Impact Dataset",
        legend = T, 
        y.labels = NULL, 
        y.at = NULL,
        col = c("#ff6961", "#84d9ff"))

A missingmap is a quick and easy way for us to gauge the missingness of our dataset however since it is visually based we can miss some details on a first glance, such as the missingniss along year and mass columns. If we run this code below we will get a raw count of the NA values per column.

sapply(meteorites, function(y) sum(length(which(is.na(y))))) %>% 
  data.frame() %>% 
  dplyr::rename(NA_COUNT = 1) %>% 
  dplyr::arrange(desc(NA_COUNT))
##             NA_COUNT
## reclat          7315
## reclong         7315
## year             288
## mass             131
## name               0
## id                 0
## nametype           0
## recclass           0
## fall               0
## GeoLocation        0

We can calculate the percentage of missing values by dividing the sum of NA’s by the product of rows multiplied by columns for all data points, which returns us 3.2%.

\[NA= \frac{\Sigma(NA)}{\Pi(Rows * Column)}\]

sum(is.na(meteorites))/prod(dim(meteorites))
## [1] 0.03291845

This number is low enough where we can simply filter these data-points out of our dataset and proceed without the need over imputation. We remove the missing values from our dataset and run a missing values map to confirm this as shown below.

# Filtering out missing values 
meteorites_clean <- meteorites %>% 
  dplyr::filter(!is.na(year) & !is.na(mass) & !is.na(reclat) & !is.na(reclong)) 

missmap(meteorites_clean, 
        main = "Missingness Map of Meteorite Impact Dataset",
        legend = T, 
        y.labels = NULL, 
        y.at = NULL,
        col = c("#ff6961", "#84d9ff"))

If we run summary() We can see below two (2) main issues:

summary(meteorites_clean)
##      name                 id          nametype           recclass        
##  Length:38116       Min.   :    1   Length:38116       Length:38116      
##  Class :character   1st Qu.:10832   Class :character   Class :character  
##  Mode  :character   Median :21733   Mode  :character   Mode  :character  
##                     Mean   :25343                                        
##                     3rd Qu.:39887                                        
##                     Max.   :57458                                        
##       mass              fall                year          reclat      
##  Min.   :       0   Length:38116       Min.   : 601   Min.   :-87.37  
##  1st Qu.:       7   Class :character   1st Qu.:1986   1st Qu.:-76.72  
##  Median :      29   Mode  :character   Median :1996   Median :-71.50  
##  Mean   :   15600                      Mean   :1990   Mean   :-39.59  
##  3rd Qu.:     187                      3rd Qu.:2002   3rd Qu.:  0.00  
##  Max.   :60000000                      Max.   :2101   Max.   : 81.17  
##     reclong        GeoLocation       
##  Min.   :-165.43   Length:38116      
##  1st Qu.:   0.00   Class :character  
##  Median :  35.67   Mode  :character  
##  Mean   :  61.31                     
##  3rd Qu.: 157.17                     
##  Max.   : 178.20

First, We can see some meteorites have a mass of 0 grams, this could be because the actual mass was so small it did not register past a decimal point. To correct this we can filter these objects out.

meteorites_clean <- meteorites_clean %>% 
  dplyr::filter(mass != 0)
  1. We can also see that our data starts at 601 and ends at 2101. We can assume these to be errors, and will need to filter out.

We will use 2016 as our ‘ceiling’ or the max point we want to find out. We can remove these rows and then make a density plot to see where our ‘floor’ should, that is to mean the lowest values we want filtered out.

meteorites_clean <- meteorites_clean %>% 
  dplyr::filter(year <= 2016)
 
density <- density(meteorites_clean$year)

plot_ly(x = ~density$x, y = ~density$y, type = 'scatter', 
        mode = 'lines', fill = 'tozeroy', 
        fillcolor = "#DB4437", 
        line = list(color = 'white')) %>%
   
layout(title = "Meteorite Impact Frequency over Time (600 - 2016)",
       hovermode = "x-unified",
       titlefont = list(family = "Agency FB", 
                        size = 25, 
                        color = '#ffffff'),
       font = list(family = "Agency FB", size = 15),
       margin = 10,
       paper_bgcolor='black',
       plot_bgcolor='black',
       xaxis = list(title = "Year", 
                    color = '#ffffff'),
       yaxis = list(title = "Density of Impacts", 
                    color = '#ffffff'))
## Warning: The titlefont attribute is deprecated. Use title = list(font = ...)
## instead.

We can see the number of meteorite impacts increases over time as both our understanding and technology improves, however we see a significant jump approx after 1960’s. With this in mind we’ll filter out for values before 1960, this choice is arbitrary admittedly but will give us a clearer picture of ‘recent’ meteorite impact sites in our research area.

meteorites_clean <- meteorites_clean %>% 
  dplyr::filter(year >= 1960)

Here is what our new distribution looks like as filtered from 1960 - 2016 (76 yrs). We can see three (3) definite peaks in our density distribution which all occur in the late years of the decades they are in (late 70’s,late 80’s, and late 90’s)

density <- density(meteorites_clean$year)

plot_ly(x = ~density$x, y = ~density$y, type = 'scatter', 
        mode = 'lines', fill = 'tozeroy', 
        fillcolor = "#4285F4", 
        line = list(color = 'white')) %>%
  
  layout(title = "Meteorite Impact Frequency over Time (1960 - 2016)",
         hovermode = "x-unified",
         titlefont = list(family = "Agency FB", 
                          size = 25, 
                          color = '#ffffff'),
         font = list(family = "Agency FB", size = 15),
         margin = 10,
         paper_bgcolor='black',
         plot_bgcolor='black',
         xaxis = list(title = "Year", 
                      color = '#ffffff'),
         yaxis = list(title = "Density of Impacts", 
                      color = '#ffffff'))
## Warning: The titlefont attribute is deprecated. Use title = list(font = ...)
## instead.

Next we need to filter out for this impacting the continental united states. To achieve this we’ll need to select the points for the north/south and east/west boundaries of the continental

# hacky method using the northern/southern, and eastern/western most coords to create a box to filter on 
meteorites_usa <- meteorites_clean %>% 
  dplyr::filter(reclat <= 49.3457868 & reclat >= 24.7433195) %>% #filter for north and south 
  dplyr::filter(reclong >=  -124.7844079 & reclong <= -66.9513812) # filter for east and west 

Now with out dataset properly cleaned and prepped we can move onto the Exploratory Data Analaysis (EDA) step.

Exploratory Data Analysis (EDA)

First let’s examine the frequency over time

Impacts_by_year <- meteorites_usa %>% 
  dplyr::group_by(year) %>% 
  dplyr::summarise(n = n()) %>% 
  dplyr::ungroup() %>% 
  dplyr::arrange(year) %>% 
  data.frame()

plot_ly(Impacts_by_year, x = ~year, y = ~n, type = 'bar', 
        marker = list(color = '#ADFF2F'), 
        name = "") %>% 
  layout(hovermode = "x-unified",
         title = "Meteorite Impacts in America Over Time",
         titlefont = list(
           family = "Agency FB", size = 20, color = '#ffffff'),
         font = list(family = "Agency FB", size = 16),
         margin = 10,
         paper_bgcolor='black',
         plot_bgcolor='black',
         xaxis = list(title = "", 
                      color = '#ffffff'),
         yaxis = list(title = "No. of Impacts", 
                      color = '#ffffff'))
## Warning: The titlefont attribute is deprecated. Use title = list(font = ...)
## instead.

We’ll map out all 971 points on a map to see how the data appears.

Mass

Impacts by Mass

Here we have mapped out the impact sites, color coding and scaling the size of each point by the mass (g).

pal <- colorNumeric(palette = "RdYlBu", domain = meteorites_usa$mass, reverse = T)

leaflet(data = meteorites_usa) %>% 
        addFullscreenControl(position = "topright") %>% 
        addResetMapButton() %>% 
        addProviderTiles(providers$Stamen.TonerLite)%>% 
        addCircleMarkers(lng = ~reclong, lat = ~reclat,
                         color = ~pal(mass), 
                         radius = meteorites_usa$mass/1e5, 
                         stroke = FALSE, fillOpacity = 1) %>% 
        
        addLegend("bottomright", pal = pal, values = ~mass,
                  title = "Mass (grams) <hr>",
                  labFormat = labelFormat(suffix = " g"),
                  opacity = 1)

We can see some clustering along the border between New Mexico & Texas but mostly it’s sporadic and random everywhere else. We also see an impact in california that is massive relatively speakin, clocking in at a mass of 2,753,000 grams ~ 2,753 kg

Impact Mass Distribution

Below we have boxplotted the points to gauge the frequency. We can see that the mass clusters below 50K grams, with the median being ~300g.

plot_ly(meteorites_usa) %>% 
        add_boxplot(y = ~mass, jitter = 0.5, pointpos = -1.8, boxpoints = 'all', 
                    marker = list(color = '#36C5F0'), 
                    line = list(color = '#FFFC00'), 
                    name = "mass(g)") %>% 
        layout(yaxis = list(title = 'mass(g)')) %>% 
        
        layout(plot_bgcolor='black') %>% 
        layout(paper_bgcolor='black') 

Meteorites by Observed Status

In our dataset meteorites are coded via whether they were spotted on the ground after impact (“found”) or while in the mid-fall ("fell)

Overwhelmingly we can see most meteorites were spotted while on the ground after-impact was made, we have modified the size argument to help make the quantity more acute on our map.

pal <- colorFactor(c("#FF1C51", "#006AFF"), domain = c("Fell", "Found"))

labels <- c("Mid-Air", "Ground")

leaflet(data = meteorites_usa) %>% 
        addFullscreenControl(position = "topright") %>% 
        addResetMapButton() %>% 
        addProviderTiles(providers$Stamen.TonerLite) %>% 
        addCircleMarkers(lng = ~reclong, lat = ~reclat,
                         color = ~pal(fall), 
                         radius = ~ifelse(fall == "Found", 2, 4),
                         stroke = TRUE, fillOpacity = 0.6, weight = 1) %>% 
        
        addLegend("bottomright", pal = pal, values = ~fall,
                  title = "<u>Found Status</u>",
                  opacity = 1, 
                  labFormat = function(type, cuts, p){
                          paste0(labels)})

Distribution of Meteorites by Years

Here we can interestingly that the more recent impact sites are now impacting California/Arizona and that the datapoints along New Mexico /Arizona are mostly ‘older’ impact sites.

pal <- colorBin(
        palette = "RdBu",bins = 5,
        domain = meteorites_usa$year, reverse = T)


leaflet(data = meteorites_usa) %>% 
        addFullscreenControl(position = "topright") %>% 
        addResetMapButton() %>% 
        addProviderTiles(providers$Stamen.TonerLite) %>% 
        addCircleMarkers(lng = ~reclong, lat = ~reclat,
                         color = ~pal (year), 
                         radius = 4,
                         stroke = FALSE, fillOpacity = 1) %>% 
        
        addLegend("bottomright", pal = pal, values = ~year,
                  title = "<u>Year Found</u>",
                  labFormat = labelFormat(),
                  opacity = 0.9)

Impacts over Time

plot_ly(data = meteorites_usa,
        x = ~year,
        type = "histogram",
        cumulative = list(enabled=TRUE), 
        marker = list(color = '#006AFF'), 
        name = "Cumulative") %>%
        
        add_trace(data = meteorites_usa,
                  x = ~year,
                  type = "histogram", 
                  cumulative = list(enabled=F), 
                  marker = list(color = '#00ffff'),
                  name = "Per Year") %>% 
        
        layout(title = "Meteorite Impacts in America (1960 - 2016)",
               titlefont = list(
                       family = "Agency FB", 
                       size = 45, 
                       color = '#ffffff'),
               font = list(family = "Agency FB", 
                           size = 25),
               margin = 10,
               paper_bgcolor='black',
               plot_bgcolor='black',
               xaxis = list(title = "Year", 
                            color = '#ffffff'),
               yaxis = list(title = "No. of Impacts", 
                            color = '#ffffff'))
## Warning: The titlefont attribute is deprecated. Use title = list(font = ...)
## instead.

Meteorite Classifications

one of a large number of meteorite classifications based on physical, chemical, and other characteristics

Meteorites have a classification system based on the mineralogical, petrological, chemical, and isotopic properties of the meteorite. Here we have calculated the frequency for each class in our dataset and filtered for the top 10 most represented classes in our data. We visualize this distribution in a pie chart.

Our top 10 most frequent classes by number of impacts were: 1. H5 (202) 2. L6 (164) 3. H4 (125) 4. H6 (93) 5. L5 (92) 6. 0C (37) 7. L4 (36) 8. Iron, IIIAB (20) 9. CK4 (11) 10. LL6 (10)

# Create a frequency for our class variable 
class_freq_table <- meteorites_usa %>% 
        dplyr::group_by(recclass) %>% 
        dplyr::tally() %>% 
        dplyr::ungroup() %>% 
        dplyr::arrange(desc(n)) %>% 
        dplyr::top_n(n = 10, wt = n)


colors =  brewer.pal(10, "Spectral")

plot_ly(class_freq_table, labels = ~recclass, values = ~n, type = 'pie',
        textposition = 'inside',
        textinfo = 'label+percent',
        insidetextfont = list(color = '#FFFFFF'),
        # height = 500, width = 910,
        
        marker = list(colors = colors,line = list(color = 'white', width = 2)),
        showlegend = FALSE, 
        hoverinfo = 'text',
        text = ~paste('<b>Meteorite Classification:</b>', recclass, "<br>", 
                      "<b>No. of Impacts:</b>", n, "</br>", 
                      "<b>No. of Impacts:</b>", round(n/sum(n)*100, 1),"%", "</br>")) %>% 
        
        layout(title = 'Meteorites Classes in USA (1960 - 2016)',
                      xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                      yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)) %>% 
        
        layout(titlefont = list(family = "Agency FB", 
                                size = 20, 
                                color = '#ffffff'),
               font = list(family = "Agency FB", 
                           size = 12),
               margin = 10,
               paper_bgcolor='black',
               plot_bgcolor='black')
## Warning: The titlefont attribute is deprecated. Use title = list(font = ...)
## instead.